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Abstract 

In the study of random structures we often face a trade-off between realism and tractability, the latter typically 
enabled by assuming some form of independence. In this work we initiate an effort to bridge this gap by developing 
tools that allow us to work with independence without assuming it. Let Qn be the set of all graphs on n vertices and 
let S be an arbitrary subset of Qn, e.g., the set of graphs with m edges. The study of random networks can be seen 
as the study of properties that are true for most elements of S, i.e., that are true with high probability for a uniformly 
random element of S. With this in mind, we pursue the following question; What are general sufficient conditions 
for the uniform measure on a set of graphs S Qn to be approximable by a product measure? 


1 Introduction 

Since their introduction in 1959 by Erdos and Renyi and Gilbert [3, respectively, G{n,m) and G{n,p) random 
graphs have dominated the mathematical study of random networks Eiini. Given n vertices, G(n, m) selects uni¬ 
formly among all graphs with m edges, whereas G{n,p) includes each edge independently with probability p. A 
refinement of G(n, m) are graphs chosen uniformly among all graphs with a given degree sequence, a distribution 
made tractable by the configuration model of Bollobas El. Due to their mathematical tractability these three models 
have become a cornerstone of Probabilistic Combinatorics and have found application in the Analysis of Algorithms, 
Coding Theory, Economics, Game Theory, and Statistical Physics. 

This mathematical tractability stems from symmetry: the probability of each edge is either the same, as in G{n,p) 
and G{n,m), or merely a function of the potency of its endpoints, as in the configuration model. This extreme 
symmetry bestows such graphs with numerous otherworldly properties such as near-optimal expansion. Perhaps most 
importantly, it amounts to a complete lack of geometry, as manifest by the fact that the shortest path metric of such 
graphs suffers maximal distortion when embedded in Euclidean space M- In contrast, vertices of real networks are 
typically embedded in some low-dimensional geometry, either explicit (physical networks), or implicit (social and 
other latent semantics networks), with distance being a strong factor in determining the probability of edge formation. 

While these shortcomings of the classical models have long been recognized, proposing more realistic models is 
not an easy task. The difficulty lies in achieving a balance between realism and analytical tractability: it is only too 
easy to create network models that are both ad hoc and intractable. By now there are thousands of papers proposing 
different ways to generate graphs with desirable properties ||9| and the vast majority of them only provide heuristic 
arguments to back up their claims. Eor a gentle introduction the reader is referred to the book of Newman ifTSll and for 
a more mathematical treatment to the books of Chung and Lu ID and of Durrett Q. 

In trying to replicate real networks one approach is to keep adding features, creating increasingly complicated 
models, in the hope of matching observed properties. Ultimately, though, the purpose of any good model is prediction. 
In that sense, the reason to study random graphs with certain properties is to understand what other graph properties 
are typically implied by the assumed properties. Eor instance, the reason we study the uniform measure on graphs with 
m edges, i.e., G{n, m), is to understand “what properties are typically implied by the property of having m edges” 
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(the answer cast as “properties that hold with high probability in a ‘random’ graph with m edges”). Notably, analyzing 
the uniform measure even for this simplest property is non-trivial. The reason is that it entails the single massive 
random choice of an m-subset of edges, rather than many independent choices. In contrast, the independence of edges 
in G{n,p) makes the distribution far more accessible, dramatically enabling analysis. 

It is a classic result of random graph theory that for p = p{m) = the random graph G ~ G{n, m) and 

the two random graphs ^ G{n, (1 ± e)p) can be coupled so that, viewing each graph as a set of edges, with high 
probability, 

G" C G C G+ . (1) 

The significance of this relationship between what we wish to study (uniform measure) and what we can study (product 
measure) can not be overestimated. It manifests most dramatically in the study of monotone properties: to study such 
a property in G ~ G(n, m), it suffices to consider G“*' and show that it typically does not have the property (negative 
side), or G~ and show that it typically does have the property (positive side). This connection has been thoroughly 
exploited to establish threshold functions for a host of monotone graph properties such as Connectivity, Hamiltonicity, 
and Subgraph Existence, making it the workhorse of random graph theory. 

In this work we seek to extend the above relationship between the uniform measure and product measures to 
properties more delicate than having a given number of edges. In doing so we (i) provide a tool that can be used 
to revisit a number of questions in random graph theory from a more realistic angle and (ii) lay the foundation for 
designing random graph models eschewing independence assumptions. For example, our tool makes short work of the 
following set of questions (which germinated our work): 

Given an arbitrary collection of n points on the plane what can be said about the set of all graphs that can be built 
on them using a given amount of wire, i.e., when connecting two points consumes wire equal to their distance? What 
does a uniformly random such graph look like? How does it change as a function of the available wire? 

1.1 Our Contribution 

A product measure on the set, C/„, of all undirected simple graphs with n vertices is specified by a symmetric matrix 
Q G [0,1]"^’^ where Qa = 0 for i G [n]. By analogy to G{n,p), we denote by G(n, Q) the measure in which each 
possible edge {i, j} is included independently with probability Qij = Qji. Let S C he arbitrary. Our main result 
is a sufficient condition for the uniform measure over S, denoted by U (S'), to be approximable by a product measure 
in the following sense. 

SandwichabiUty. The measure U (S) is (e, 6)-sandwichable if there exists annx n symmetric matrix Q such that the 
distributions G^ ~ G(n, (1 ± e)Q), and G ~ U{S) can be coupled so that G~ C G C G"*" with probability at least 
1 - 8 . 

Informally, the two conditions required for our theorem to hold are as follows: 

Partition Symmetry. The set S should be symmetric with respect to some partition of the ( 2 ) possible edges. That is, 
the characteristic function of S can depend on how many edges are included from each part but not on which. The set 
of all graphs with m edges satisfies this trivially: place all edges in one part and let the characteristic function be the 
indicator that exactly m edges are included. Far more interestingly, in our motivating example edges are partitioned 
into equivalence classes according to their length (distance of endpoints) and the characteristic function allows every 
combination (vectors) of number of edges from each part that does not violate the total wire budget. We discuss the 
motivation for edge-partition symmetry at length in Section|2] 

Convexity. Partition symmetry reduces the study of U(S) to the study of the induced measure on the set of all possible 
combinations (vectors) of numbers of edges from each part. As, in principle, this set can be arbitrary we must impose 
some regularity. We chose to do this by requiring this discrete set to be convex (more precisely that it contains all 
integral points in its convex hull). While convexity is actually not necessary for our proof method to work it provides a 
clean conceptual framework while allowing very general properties to be expressed. These include properties express¬ 
ible as Linear Programs over the number of edges from each part and even non-linear sets of constraints expressing 
the presence or absence of percolation. Most importantly, since convex sets are closed under intersection, convex 
properties can, be composed (set intersection) while maintaining approximability by a product measure. 
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We state our results formally in Section|4] The general idea is this. 

Theorem 1 (Informal). If S is a convex symmetric set with sufficient edge density, then U{S) is sandwichable by a 
product measure G(n, Q*). 

The matrix Q is constructed by maximizing a concave function (entropy) over a convex domain (the convex hull 
mentioned above). As a result, in many cases it can be computed explicitly, either analytically or numerically. Further, 
it essentially characterizes the set S, as all quantitative requirements of our theorem are expressed only in terms of 
Q*, k the number of parts in the partition and n the number of vertices. 

The proof of the theorem crucially relies on a new concentration inequality we develop for symmetric subsets of 
the binary cube and which, as we shall see, is sharp. Besides enabling the study of monotone properties, our results 
allow obtaining tight estimates of moments (expectation, variance, etc) of local features like subgraph counts. 

Outline of the Paper In the next two sections we provide motivation (Section|2]i and some example applications of 
our theorem (Section[3ll. We state our results formally in Section 4 and provide a technical overview of the proofs in 
Section 5. In Section 6, we discuss a connection of our work to Machine Learning and to Probabilistic Modeling in 
general. The rest of the paper is devoted to the proofs. In Sections|7]and0we present the proofs of concentration and 
sandwichability respectively. Finally, Section|9]provides the proofs of some basic supplementary results. 

2 Motivation 

As stated, our goal is to enable the study of the uniform measure over sets of graphs. The first step in this direction is 
to identify a “language” for specifying sets of graphs that is expressive enough to be interesting and restricted enough 
to be tractable. 

Arguably the most natural way to introduce structure on a set is to impose symmetry. This is formally expressed 
as the invariance of the set’s characteristic function under the action of a group of transformations. In this work, we 
explore the progress that can be made if we define an arbitrary partition of the edges and take the set of transformations 
to be the the Cartesian product of all possible permutations of the edges (indices) within each part (symmetric group). 
While our work is only a first step towards a theory of extracting independence from symmetry, we argue that symmetry 
with respect to an edge partition is well-motivated for two reasons. 

Existing Models. The first is that such symmetry, typically in a very rigid form, is already implicit in several random 
graph models besides G{n, m). Among them are Stochastic Block Models (SBM), which assume the much stronger 
property of symmetry with respect to a vertex partition, and Stochastic Kronecker Graphs iflTll . The fact that our notion 
of symmetry encompasses SBMs is particularly pertinent in light of the theory of Graph Limits Il6l, since inherent 
in the construction of the limiting object is an intermediate approximation of the sequence of graphs by a sequence of 
SBMs, via the (weak) Szemeredi Regurality Lemma ciia. Thus, any property that is encoded in the limiting object, 
typically subgraph densities, is expressible within our framework. 

Enabling the Expression of Geometry. A strong driving force behind the development of recent random graph 
models has been the incorporation of geometry, an extremely natural backdrop for network formation. Typically this 
is done by embedding the vertices in some (low-dimensional) metric space and assigning probabilities to edges as 
a function of length. Our work enables a far more light-handed approach to incorporating geometry by viewing it 
(i) as a symmetry rendering edges of the same length equivalent, while (ii) recognizing that it imposes macroscopic 
constraints on the set of feasible graphs. Most obviously, in a physical network where edges (wire, roads) correspond 
to a resource (copper, concrete) there is a bound on how much can be invested to create the network while, more 
generally, cost (length) may represent a number of different notions (e.g., class membership) that distinguish between 
edges. 

Perhaps the most significant feature of our work is that it fully supports the expression of geometry, by allowing 
the partition of edges into equivalence classes, without imposing any specific geometric requirement, i.e., without 
mandating the partition. 

Why Convexity? As mentioned, partition symmetry reduces the study of U (5) to the study of the induced distribution 
on valid combinations (vectors) of numbers of edges from each part. Without any assumptions this set can be arbitrary. 
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e.g., S can be the set of graphs having either or edges, rendering any approximation by a product measure 
hopeless. To focus on the cases where an approximation would be meaningful we adopt the conceptually clear and 
intuitive condition of convexity. In reality, convexity of the set is actually a proxy of the real property that we use in 
the proof, that of approximate unimodality (Section|3- 


3 Applications 

Given a partition, let (G) denote the number of edges from part i in a graph G, let m(G) denote the edge-profile of 
G, and let xn{S) = {m(G) : G € S'} for S' C 

Bounded Budget. Given a vertex set V and any cost function c :V xV ^ 1R+ partition the edges into equivalence 
classes according to cost. Given a budget B, let S{B) = {G G Gn\ i-®’ '^he set of all graphs 

feasible with budget B. Using the tools developed in this paper, we can study the uniform measure on S{B) and 
show that the probability Q*,, of an edge {m, w} with cost Cuv is exponentially small in its cost, specifically = 
[1 + exp(A(i?)cii„)]“^, where \{B) is decreasing in B. 

Linear Programs. Instead of a single cost for each edge and a single budget, edges can have multiple attributes, e.g., 
throughput, latency, etc. Grouping edges with identical attributes in one part, we can write arbitrary linear systems 
whose variables are the components of the edge-profile m, expressing capacity constraints, latency constraints, explicit 
upper and lower bounds on the acceptable number of edges from a class, etc. Now, S = {G G Qn\A - m{G) < bj, for 
some matrix A = [Ai... Ak] G IR^^^ and vector b G IR^. Besides generality of expression, the entropy optimization 
problem defining Q has a closed form analytic solution in terms of the dual variables A G 1R+. The probability of an 
edge (u, v) in part i is now given by; Q^^iS) = [l + exp(v4f A)] In Section |6] we show how this result can be 
used to justify the assumptions of Logistic Regression. 

Navigability. Kleinberg ifT^ fT3l gave sufficient conditions for greedy routing to discover paths of poly-logarithmic 
length between any two vertices in a graph. One of the most general settings where such navigability is possible 
is set-systems, a mathematical abstraction of the relevant geometric properties of grids, regular-trees and graphs of 
bounded doubling dimension. The essence of navigability lies in the requirement that for any vertex in the graph, the 
probability of having an edge to a vertex at distance in the range [2*“ ^, 2*) is approximately uniform for all i G [log n]. 

In our setting, we can partition the ( 2 ) edges according to distance scale so that part Pi includes all possible 
edges between vertices at scale i. By considering graphs of bounded budget as above, in 12 we recover Kleinberg’s 
results on navigability in set-systems, but without any independence assumptions regarding network formation, or 
coordination between the vertices (such as using the same probability distribution). Besides establishing the robustness 
of navigability, eschewing a specific mechanism for (navigable) network formation allows us to recast navigability as 
a property of networks brought about by economical (budget) and technological (cost) advancements. 

Percolation Avoidance. Consider a social network consisting of £ groups of sizes piti, where pi > 0 for i G [£]. 
Imagine that we require a specific group s to act as the “connector”, i.e., that the graph induced by the remaining 
groups should have no component of size greater than en for some arbitrarily small e > 0. 

To study the uniform measure on the set of all such graphs S = S^, as with SBMs, it is natural to partition the 
possible edges in ( 2 ) equivalence classes based on the communities of the endpoints. While the set is not symmetric 
with respect to this partition our result can still be useful in the following manner. Using a well known connection 
between Multitype Branching Process El and the existence of Giant Component in mean-field models, like Erdos- 
Renyi and Stochastic Block Models, we can recast the non-existence of the giant component in terms of a condition 
on the number of edges between each block. In the sense that conditional on the number of edges, if the condition 
holds with high probability there is no giant component. Concretely, given the edge-profile M and for a given cluster 
s G [l\, define the {£ — 1) x {£ — 1) matrix: 

777 ■ • 

Tl Pi 

that encapsulates the dynamics of a multi-type branching process. Let 11-112 denote the operator norm (maximum 
singular value). A classic result of branching processes asserts that if ||T(M )||2 < 1 no giant component exists. 
Thus, in our framework, the property = { no giant component without vertices from s}, not only can be accurately 
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approximated under the specific partition P by a set 5 = {M : ||T(M)|| 2 < 1} that happens to also be a convex 
function of M. 


4 Definitions and Results 

For each convex symmetric set S we will identify an approximating product measure Q*(S') as the solution of a 
constrained entropy-maximization problem. A precise version of our theorem requires the definition of parameters 
capturing the geometry of the convex domain along with certain aspects of the edge-partition. As we will see, our 
results are sharp with respect to these parameters. Below we give a series of definitions concluding with a formal 
statement of our main result 

We start with some notation. We will use lower case boldface letters to denote vectors and uppercase boldface 
letters to denote matrices. Further, we fix an arbitrary enumeration of the N = ( 2 ) edges and sometimes represent the 
set of all graphs on n vertices as iJjv = {0,1}^. We will refer to an element of a; G Hn interchangeably as a graph 
and a string. Given a partition V = (Pi,..., P^) of [TV], we define njv(P) to be the set of all permutations acting 
only within blocks of the partition. 

Edge Block Symmetry. Fix a partition V of [TV]. A set S Q is called V-symmetric if it is invariant under the 
action ofUNi'P). Equivalently, ifls{x) is the indicator function of set S, then 15 ( 0 ::) = 1s(f{x)) for all x £ and 
TT G nAr(P). 

The number of blocks k = \V\ gives a rough indication of the amount of symmetry present. For example, when 
fc = 1 we have maximum symmetry and all edges are equivalent. In a stochastic block model (SBM) with £ classes, 
k = ( 2 ). For a d-dimensional lattice, partitioning the potential edges by distance results in roughly k = parts, 
whereas finally if fc = TV there is no symmetry whatsoever. Our results accommodate partitions with as many as 
parts. This is way more than enough for most situations. For example, as we saw, in lattices there are 
distances, while if we have n generic points such that the nearest pair of points have distance 1 while the 
farthest have distance D, fixing any 5 > 0 and binning together all edges of length [(1 -f Sf, (1 -f for T > 0, 

yields only 0{5~^ logP) classes. 

Recall that given a partition P = (Pi, ... ,Pk) of Hn and a graph x £ Hm, the edge profile of x is m(a:) := 
{mi{x ),..., mk{x)), where mi{x) is the number of edges of x from Pi, and that the image of a P-symmetric set S 
under m is denoted as m(5') C ]R*^. The edge-profile is crucial to the study of P-symmetric sets due to the following 
basic fact (proven in Section^. 

Proposition 1. Any function f : —> H invariant under njv(P) depends only on the edge-profile m(x). 

In particular, since membership in S depends solely on a graph’s edge-profile, it follows that a uniformly random 
element of S can be selected as follows: (i) select an edge profile v = (ui,..., Vk) £ from the distribution on 
m(5') induced by U{S), and then (ii) for each i £ \k] independently select a uniformly random Ui-subset of Pi. 
(Formally, this is Proposition 0] in Section |9]) Thus, conditional on the edge-profile, not only is the distribution of 
edges known, but it factorizes in a product of G(n, m) distributions. In other words, the complexity of the uniform 
measure on S manifests entirey in the induced distribution on m(S') G N^' whose structure we need to capture. 

Definition 1. Let pi = |Pi| denote the number of edges in part i of partition P. 

k 

Edge Profile Entropy. Given an edge profile v G rn(S') define the entropy ofv as Ent(v) = E log 

i=l 

Using the edge-profile entropy we can express the induced distribution on m(S') as P(v) = The crux 

of our argument is now this: the only genuine obstacle to S being approximable by a product measure is degeneracy, 
i.e., the existence of multiple, well-separated edge-profiles that maximize Ent(v). The reason we refer to this as 
degeneracy is that it typically encodes a hidden symmetry of S with respect to P. Eor example, imagine that P = 
(Pi, P 2 ), where |Pi| = IP 2 I = p, and that S contains all graphs with p/2 edges from Pi and p/3 edges from P 2 , or 
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vice versa. Then, the presence of a single edge e G Pi in a uniformly random G € S boosts the probability of all other 
edges in Pi, rendering a product measure approximation impossible. 

Note that since m(S') is a discrete set, it is non-trivial to quantify what it means for the maximizer of Ent to be 
“sufficiently unique”. For example, what happens if there is a unique maximizer of Ent(v) strictly speaking, but 
sufficiently many near-maximizers to potentially receive, in aggregate, a majority of the measure? To strike a balance 
between conceptual clarity and generality we focus on the following. 

Convexity. Let Conv(A) denote the convex hull of a set A. Say that a V-symmetric set S C is convex iff the 
convex hull ofm.{S) contains no new integer points, i.e., i/’Conv(m(S')) fl = m(S'). 

Let H-p{v) be the approximation to Ent(v) that results by replacing each binomial term with its binary entropy 
approximation via the first term in Stirling’s approximation (see ([Q]) in Section|2ll. 

Entropic Optimizer. Let m* = m*(5') G be the solution to max H-p{v). 

v€Conv(m(5)) 

Defining the optimization over the convex hull of m(S') is crucial, as it will allow us to study the set S by studying 
only the properties of the maximizer m*. Clearly, if a P-symmetric set S has entropic optimizer m* = (mj^,..., m’^), 
the natural candidate product measure for each i G [fc] assigns probability m */pi to all edges in part Pi. The challenge 
is to relate this product measure to the uniform measure on S by proving concentration of the induced measure on 
rajs') around a point near ra*. For that we need (i) the vector ra* to be “close” tcQ a vector in ra(S), and (ii) to 
control the decrease in entropy “away” from ra*. To quantify this second notion we need the following parameters, 
expressing the geometry of convex sets. 

Definition 2. Given a partition V and a V-symmetric convex set S, we define 


Thickness: 
Condition number: 


Resolution: 


M = KS) = 
A = A(S) = 
r = r(S) = 


minmin{m*,pi — m*\ 
ie[fc] 


5fc log n 


p{S) 

A + VA2 + 4A 


( 2 ) 

(3) 

(4) 


The most important parameter is the thickness p,{S). It quantifies the minimal coord- inate-distance of the optimizer 
ra*(S) from the natural boundary {0,pi} x ... x {0,pfc} where the entropy of a class becomes zero. Thus, this 
parameter determines the rate of coordinate-wise concentration around the optimum. 

The condition number A(S), on the other hand, quantifies the robustness of S. To provide intuition, in order for 
the product measure approximation to be accurate for every class of edges (part of V), fluctuations in the number of 
edges of order yjm* need to be “absorbed” in the mean m*. For this to happen with polynomially high probability 
for a single part, standard results imply we must have m* = f2(log(n)). We absorb the dependencies between parts 
by taking a union bound, thus multiplying by the number of parts, yielding the numerator in Q. Our results give 
strong probability bounds when \{S) <C 1, i.e., when in a typical graph in S the number of edges from each part is 
Ll{k log n) edges away from triviality, a condition we expect to hold in all natural applications. We can now state our 
main result. 


Theorem 2 (Main result). Let V be any edge-partition and let S be any V-symmetric convex set. For every e > 
y^l2A(S'), the uniform measure over S is (e, 6)-sandwichable, where 6 = 2 exp 

Remark 1. Ai a sanity check we see that as soon as m ^ logn, Theorem\2\recovers the sandwichability of G{n, m) 
by G{n,p{m)) as sharply as the Chernoff bound, up to the constant factor in the exponent. 

Theorem|2]follows by analyzing the natural coupling between the uniform measure on S and the product measure 
corresponding to the entropic optimizer ra*. Our main technical contribution is Theorem [3 below, a concentration 
inequality for rajS”) when S' is a convex symmetric set. The resolution, r{S), defined in (|4]i above, reflects the 
smallest concentration width that can be proved by our theorem. When A(S) <C 1, as required for the theorem to be 
meaningfully applied, it scales optimally as A(S). 

'indeed, this is the oniy use we make of convexity in the proof presented here. 


-p(S) (fi - A(S)) 
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Theorem 3. Let V be any edge-partition, let S be any V-symmetric convex set, and let m* be the entropic optimizer 
ofS. For all e > r{S), ifG ~ U{S), then 

Ps (|m(G) - m*| < em*) > 1 - exp ^-^(5) > (5) 

where x < y means that Xi < yifor all i G [k], and rhi = mm{m*, pi — m *}. 

The intuition driving concentration is that as thickness increases two phenomena occur: (i) vectors close to m* 
capture a larger fraction of the measure, and (ii) the decay in entropy away from m* becomes steeper. These joint 
forces compete against the probability mass captured by vectors “away” from the optimum. The point were they 
prevail corresponds to X{S) <C 1 or, equivalently, p{S) ^ 5k log(n). Assuming X{S) <C 1 the probability bounds we 
give scale as ). Without assumptions on S, and up to the constant 5 in Q, this is sharp. 

The concentration and sandwich theorems dramatically enable the study of monotone properties under the uniform 
measure over convex-symmetric sets. Going beyond monotone properties, we would like to enable the study of more 
local (involving a few edges) events. For example, we would like to be able to make statements about moments of 
subgraph counts (triangle, cliques, cycles) and other local-graph functions. 


5 Technical Overview 

In this section, we present an overview of the technical work involved in the proof of our theorems. Most of the work 
lies in the concentration result. Theorem^ 

Concentration. The general idea is to identify a high-probability subset C C m(S') by integrating the probability 
measure around the entropy-maximizing profile m*. Since ultimately our goal is to couple the uniform measure with 
a product measure, we need to establish concentration for every part of the edge-partition, i.e., in every coordinate. 
There are two main obstacles to overcome: (i) we do not know IS*!, and (ii) we must integrate the measure outside C 
while concurrently quantifying the decrease in entropy as a function of the ioo distance from the maximizer m*. Our 
strategy to resolve the above obstacles is: 

Size of S. We bound log [S'! from below by the contribution to log [S'! of the entropic optimal edge-profile m*, 
thus upper-bounding the probability of every v S ni(S') as 

logPs(v) = Ent(v) - log(IS'l) < Ent(v) - ENT(m*) . (6) 

This is the crucial step that opens up the opportunity of relating the probability of a vector v to the distance 11 v — m* 11 2 
through analytic properties of entropy. The importance of this step stems from the fact that all information about S 
resides in m* due to our choice of solving the optimization problem on the convex hull. 

Proposition 2. There exists a partition V with k parts and a convex V-symmetric set S Q Qn such that log(IS'l) — 
ENT(m*) = 0(fclog(n)). 

Proposition|2]demonstrates that unless one utilizes specific geometric properties of the set S enabling integration 
around m*, instead of using a point-bound for log l^l, a loss of fl{k log(n)) is unavoidable. In other words, either one 
makes more assumptions on S besides symmetry and “convexity”, or this error term is optimal as claimed. 

Distance bounds: To bound from below the rate at which entropy decays as a function of the component-wise 
distance from the maximizer m*, we first approximate Ent(v) by the corresponding binary entropy to get a smooth 
function. Exploiting the separability, concavity and differentiability of entropy we obtain component-wise distance 
bounds using a second-order Taylor approximation. At this step we also lose a cumulative factor of order 3fc log n 
stemming from Stirling approximations and the subtle point that the maximizer m* might not be an integer point. The 
constant 3 can be improved, but in light of Proposition|2]this would be pointless and complicate the proof unnecessarily. 

Union bound: Einally, we integrate the obtained bounds outside the set of interest by showing that even if all “bad” 
vectors where placed right at the boundary of the set, where the lower bound on the decay of entropy is smallest, the 
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total probability mass would be exponentially small. The loss incurred at this step is of order 2k log n, since there are 
at most bad vectors. 

Relaxing Conclusions. Our theorem seeks to provide concentration simultaneously for all parts. That motivates the 
definition of thickness parameter fi{S) as the minimum distance from the natural boundary of the number of edges that 
any part i has at the optimum m*. Quantifying everything in terms of is a very conservative requirement. For 
instance, if we define the set S to have no edges in a particular part of the partition, then /i(5') is 0 and our conclusions 
become vacuous. Our proofs in reality generalize, to the case where we confine our attention only to a subset I ^ [k] 
of blocks in the partition. In particular, if one defines I* as the set of parts whose individual thickness parameter 
rfii = mm{mi,pi — rui} is greater than 5/clogn, both theorems hold for the subset of edges \Ji^i*Pi. In essence 
that means that for every class that is “well-conditioned”, we can provide concentration of the number of edges and 
approximate monotone properties of only those parts by coupling them with product measures. 

Relaxing Convexity. Besides partition symmetry that comprises our main premise and starting point, the second 
main assumption made about the structure of S is convexity. In the proof (Lemma in Section |7]i convexity is used 
only to argue that; (i) the maximizer m* will be close to some vector in m(S'), and (ii) that the first order term in the 
Taylor approximation of the entropy is always negative. However, since the optimization problem was defined on the 
convex hull of m(S'), in point (ii) above we are only using convexity of Conv(m(S')) and not of the set S. Thus, the 
essential requirement on P-symmetric sets is approximate unimodality. 

Definition 3. A V-symmetric set S is called A-unimodal if the solution m* to the entropy optimization problem 
defined in Section 2, satisfies: 

di{ni*,S):= min ||m* —v||i<A (7) 

Convexity essentially implies that the set S is fc-unimodal as we need to round each of the k coordinates of the 
solution to the optimization problem to the nearest integer. Under this assumption, all our results apply by only 
changing the condition number of the set to X{S) = ^ extended abstract, we opted to present our 

results by using the familiar notion of convexity to convey intuition on our results and postpone the presentation in full 
generality for the full version of the paper. 

Coupling. To prove Theorem |2] using our concentration result, we argue as follows. Conditional on the edge-profile, 
we can couple the generation of edges in different parts independently by a similar process as in the G{n, m) to 
G{n,p) case. Then, using a union bound we can bound the probability that all couplings succeed given an appropriate 
V. Finally, using the concentration theorem we show that sampling an appropriate edge-profile happens with high 
probability. 

6 Discussion 

In studying the uniform measure over symmetric sets of graphs our motivation was twofold. On one hand was the 
desire to explore the extent to which symmetric sets (properties) can be studied via product measure approximations 
(instead of assuming product form and studying properties of the resulting ad-hoc random graph model). On the other 
hand, we wanted to provide a framework of random graph models that arise by constraining only the support of the 
probability distribution and not making specific postulations about the finer probabilistic structure, e.g., at the level of 
edges. An unanticipated result of our work is a connection to Machine Learning (ML). 


6.1 Connection to Machine Learning 

In Machine Learning, a probabilistic model for the data is assumed and exploited to perform some inferential task (e.g. 
classification, clustering, estimation). Logistic Regression is a widely used modeling, estimation and classification 
tool. In the simplest setting, there are N binary outcomes Yi,..., Ym and each Yi G {0,1} is associated with a feature 
vector Xi G A statistical model is constructed by assuming that there exists an (unknown) vector j3 G such 
that; (i) the N outcomes are mutually independent given and (ii) the marginal probability of an outcome is given 



by P^(Fi = 1) = [l + exp(Xf • /3)] Thus, given the outcomes and features one seeks to find by Maximum 
Likelihood Estimation (MLE). 

In the context of network modeling a, the N = ( 2 ) binary outcomes correspond to edges and the N feature 
vectors can be either known or latent M- In both the Classification and Network Modeling settings, it is natural 
to wonder whether both assumptions (independence and parametric form of the likelihood) are realistic and not far¬ 
fetched idealizations. Our work provides, to the best of our knowledge, the^rif justification for both assumptions, 
when the features (covariates) take only k distinct values. In fact, let Xg for q G [fc] denote the common 

feature vector of outcomes from part g and let X = [X i... X k\- Our results say that logistic regression is the product 
measure approximation of the uniform measure on the set S = S{X,c{fi)) = |a; € {0,1}^| X ■ m{x) < cjH and 
that the logistic approximation, amounting to assumptions (i)-(ii) above, is valid whenever m(a:) is in each coordinate 
0(fc log n) far from triviality, i.e. there are enough positive and negative labels from each part. An in depth exploration 
of such connections, i.e., between widely used techniques in Machine Learning and the uniform measure over sets, is 
a promising direction that we pursue in future work. 


7 Proof of Theorem 2 


In this section we prove Theorem^ Eor the purposes of the proof we are going to employ, instead of m ( 2 ;), a different 
parametrization in terms of the edge-profile a{x) = (ai(a;),..., ak{x)) € [0,1]^ where ai{x) = mi{x)/pi. This will 
be convenient both in calculations as well as conceptually as afix) represents the effective edge density of a part Pi in 
the partition. We start by approximating the entropy of an edge-profile via the 7^-entropy. 

Definition 4. Given a partition V, the V-entropy is define for every a G [0,1]^ as 

k 

iJp(a) = -'^pi [a^logoi -f (1 - aj)log(l - m)] (8) 

i=l 

The 7^-entropy is simply the entropy of the product measure defined over edges through a. We slightly abuse the 
notation and also define the 7^-entropy in terms of the edge-profile: 


iJ-p(v) 


E 



+ {pi 



(9) 


Let Ai-p := {0,... ,pi} x ... x {0,... ,pk} be the space of all possible vectors m. In what follows we sometimes 
supress the dependence of the quantities in m or a to ease the notation. 

Lemma 1. Let m G Aip be an edge-profile and a G [0,1]*^ be the corresponding probability profile, then: 


ENT(m) = 7f-p(a) — y{n) 


where 0 < 7 (n) < k log n is a term that approaches zero as rrii andpi — rrii tend to infinity. 

Proof. We begin by providing the first order Stirling approximation for a single term of the form log (^^). Specifically, 
since rui = piOt and by using log n\ = n log n — n -f i log n -f where G (0,1], we get: 


log 



log(pi!) - log(mj) - log((pi - m,)!) 


-pi [oi logOi -f (1 - af) log(l - Oi)] - Sn{ai,pi) , 


^For some c S IR^ such that jS is the Lagrangian multiplier of an entropy optimization problem over S. 
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where 0 < 5n{cii^Vi) ^ logn. Summing the derived expression for all i € [k] gives; 

k k 

ENT(m) = [«i logoj + (1 - Q»)log(l - qQ] -y^^Sn{ai,Pi) 

= H-p{a)-j{n) , 

where 0 < 7 (n) < k log n . □ 

Next, using the Taylor remainder theorem and the connection with 7^-entropy, we obtain geometric estimates on 
the decay of entropy around m*. 

Theorem 4 (Taylor Remainder Theorem). Assume that f and all its partial derivatives are differentiable at every 
point of an open set S C ^a, b S S are such that the line segment L(a, b) C S, then there exists a point 

z € L(a, b) such that: 


/(b) - /(a) = V/(a)^(b - a) + -(b - a)^VV(z)(b - a) . 


( 10 ) 


Lemma 2 (L 2 distance bounds), Ifm* is the unique maximizer and w G Conv(m(5')), then 

k 


Ent(w) - ENT(m*) < - 


{wi - m*y 


3k log n 


maxjm*, Wil 

1 — 1 ^ t ' J 

where m* = min{m*, pi — m* }(respectively Wi), denotes the thickness of a part i G [fc]. 


( 11 ) 


Proof Invoking Lemma [T] we rewrite the difference in entropy as a difference in P-entropy, where a* is the proba¬ 
bility profile of the maximizer and b of w: 


Ent(w) — ENT(m*) < H-pih) — H'p{a*) -t- Sfclogn 

Here, we have additionally dealt with the subtle integrality issue, namely that m* might not belong to m(5'). Rounding 
the vector to the nearest integral point produces a cumulative error of at most 2k log(n) in the entropy that adds to 
the k log(n) error coming from Stirling’s approximation. Both errors can be reduced using higher order Stirling 
approximations but we avoid doing so since an error of order k log(n) is unavoidable due to the approximation of the 
partition function. 

Convexity of the domain Conv(m(S')) and differentiability of the 7^-entropy provide the necessary conditions to 
use the Taylor Remainder Theorem. Let z be a point in the linear segment L{a*, b). We proceed with writing the 
expressions for partial derivatives of H-p. 

d,Hp{a*) = -paog(^^^^ (12) 

dlHp{z) = -Pi(-r^ -f—) , (13) 

V1 - 2 ;, z,J 

while dfjf = 0 for i y j due to separability of the function Hp. The Taylor Remainder forumla, now reads: 

Hp{h) - Hp{a*) = VHpia*) ■ (b - a*) - (14) 

\l-Zi ZiJ 

Since, a* is the unique solution to the sc MaxEnt problem and the domain is convex, the first term in the above formula 
is always bounded above by zero. Otherwise, there would be a direction u and a small enough parameter e > 0 such 
that a* -b eu has greater entropy, a contradiction. To bound the second sum from above, let Zi = min{zi, 1 — 
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zz}(expressing the fact that binary entropy is symmetric around 1/2) and use the trivial bound Zi < max{d*,6i}. 
Thus, 

gp(b) - gp(a*) < - T'f y • (15) 

i=i i=i max{aj,6,} 

Dividing and multiplying by pi, and writing Wi = pibi, rh* = Pih*, gives: 


K 




(wi-m*) 


*^2 


^ max{7Ti*,tZ'i} 


(16) 


where w and m* are the original edge profiles. We note that for most cases we have that Zi = Zi, i.e. a block is 
half-empty. □ 

In preparation of performing the ’’union bound”, we prove that: 

Propositions. The number of distinct edge-profiles |m(S')| is bounded by \M.-p\ < 

Proof. Assuming that no constraint is placed upon m by S, then m(S') = M.-p. This number is equal to the product 
of [pi -b 1] < as there are at most ( 2 ) edges within a block. Multiplying the last bound we get the statement. □ 

Before proceeding with the proof of the concentration theorem, we repeat the definitions of the crucial parameters 
mentioned in the introduction. 


Definition 5. Given a partition V and a V-symmetric set S, define: 


p{S) = 

mm {m* A (p* - m*)} 
ie[fe] 

(17) 

X{S) = 

5fclogn 

p{S) 

(18) 

r{S) = 

A -b VA2 -b 4A 

(19) 

2 


the Thickness, condition number and resolution of the convex set S and xAy := min{a:, y} denotes the min operation. 

Proof of Theorem]^ Our goal is to use the developed machinery to control the probability of deviations from the 
optimum at scale e > r(S'). Define the set £(;(m*) = {x & S \ |m(a;) — m*| < eiii*}. We are going to show 
that P 5 (£/(m*)) —0 “exponentially” fast and thus provide localization of the edge profile within a scale e for each 
coordinate. To that end, we write: 

Ps(/:^(m*)) = y] Ps(w) < exp [ENT(m) - ENT(m*)] (20) 

where we have used (|6]l, the approximation of the log-partition function by the entropy of the optimum edge-profile, 
and added the contribution of points outside of m(5'). At this point, we are going to leverage the lower bound for the 
decay of entropy away from m*. This is done by first performing a union bound, i.e. considering that all points in 
£^(m*) are placed on the least favorable such point w*. Since, we are requiring coordinate-wise concentration, such 
point would differ from the optimal vector only in one-coordinate, and in particular should be the one that minimizes 
our lower bound. Any such vector w G (m* ), would have at least one coordinate i G [fc] such that | Wi — m* \ = em*. 
By Lemma fin we get 

Ent(w) — ENT(m*) < — - 1 —- -b Sfclogn = — - -m* -b 3fclogn (21) 

(1 -b e)m* (1 -b e) 
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using the facts that max{77ii, zZi^} < fhi + Wi < (1 + e)m*. Now, by definition the thickness /i(S') < ifi* for all 
i € [k], and so a a vector w* that minimizes the bound is such that Ent(w*) — ENT(m*) < — + 3k log n. 

We perform the union bound by using |E°(m*)| < \M.'p\ < exp(2fclogn) from Proposition[3] 


Ps(/:^(m*)) < |£=(m*)|.P5(w* 


< exp 

< exp 


L (i + e) 
-KG) 


KG) + Sfclogn 
5fc log n 


l + e 


( 22 ) 

(23) 

(24) 


Einally, identifying A(S') in the expression provides the statement. We note here that the resolution r{S) is defined 
exactly so that the expression in the exponent is negative. The condition A(S') ^ 1 is a requirement that makes 
concentration possible in a small scale, i.e e <C 1. □ 


Tightness. The crucial steps in the proof are, firstly, the approximation of the log-partition function and, secondly, 
the L 2 distance bounds on the decay of entropy away from the optimum. Both steps are essentially optimal under 
general assumptions, as is shown in Proposition |2l Our proof can only improved by using higher order Stirling 
approximations and a more complicated integration process (incremental union bounds over Loo-shells) instead of the 
simple union bound, to reduce the error from 5fc log n down to possibly the minimum of 2k log(n). Since, the above 
considerations would complicate the proof significantly and the gain is a small improvement in the constant we deem 
this unnecessary. 


8 Proof of Theorem 1 

In this section, we leverage the concentration theorem to prove that convex 7^-symmetric sets are (e, (5)-sandwichable. 
Before presenting the proof of the theorem we state a preliminary lemma. 

8.1 Basic Coupling Lemma 

Consider a set of random variables Xi ,..., Xk with laws /xi,..., A coupling between a set of random variables is 
a (joint) probability distribution /x, such that (Xi) for i G [k], i.e. the marginals of the random variables 

are right. Let A = N} be a finite set with N elements. Eurther let X denote a uniform subset of m elements of 

A, denoted as X ^ Samp(m, A), and Z a subset of A where each element of A is included with the same probability 
p, denote as Z ~ Flip(p, A). 

Lemma 3. Given a set A with N elements and a number m, define im) = Consider the random variables 

X ~ Samp(m, A) and ~ Flip(p^, A), then there exists a coupling p such that: 

P„ (Z- CXCZ-)>1- 2exp - (25) 

Proof. Let /x be the joint distribution of C/i,..., Un i-i.d uniform in [0,1] random variables, and C/(m) to denote the 
m-th smallest such random variable. Define XlJJ) = {i G A ■. Ui < [/(„)} and Z^(U) = {i G A ■. Ui < pK 
to be random subsets of A. By construction it is easy to see that X{U) and Z^{U) have the right marginals. By 
construction of the sets, it is easy to see that the following equivalence holds: 

z- c X c z+ ^ jz-j < |x| < jz+i 
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To analyze the second event define the “bad” events; 


N 

= {u e [0,1]^ : ^ I{ui < P-) > m} 

N 

B+ = {u € [0, 1 ]^ : l{ui < p+) < to} 

i=l 


Each event can be stated as the probability that the sum X± of n i.i.d Bernoulli p± random variables exceeds (smaller 
then) the expectation np±. By employing standard Chernoff bounds, we get: 


i{B-) = P( 7 (X_ > m) = > (1 + (5)np_) < exp(- 


(52 


3(1 + 6) 


Pj/(-B+) = P( 7 (X+ < m) = Pj/(X < (1 - S)np+) < exp(- 


,52 


2(1-(5) 


m) 


m) 


The proof is concluded through the use of union bound: 

P^(B_ US+) < P^(B_) +Pj/(S+) < 2exp 

This concludes the lemma. □ 

Using, this simple lemma and Theorem 2, we prove the sandwich theorem. 

8.2 Main proof. 

Recall, that our aim is to prove that the uniform measure over the set S is (e, (5)-sandwichable by some product measure 
G(n,Q). 

Proof of Theorem 1. Given a P-symmetric convex set S, consider m*(S') the optimal edge-profile and define the 
n X n matrix Q*(S') as: Qu,v = G and 5 G [k]. Further, define q.^ := ^,V G [k] to be used later. 

In order to prove the required statement, we need to construct a coupling between the random variables G ^ U(S), 
~ G(n, (1±)Q*). By separating edges according to the partition, we express the edge set of the graphs as 
E(G) = El U ... U Pfc and P(G±) =Etu...UE^. 

Let pL denote the joint probability distribution of TV + 1 i.i.d. uniform random variables Ui,..., Un+i on [0,1]. 
As in the coupling lemma, we are going to use these random variables to jointly generate the random edge-sets of 
G ~, G, G"*". Using Un+i, we can first generate the edge profile v G ni(5') from its corresponding distribution. Then, 
conditional on the edge profile v G IN^, the probability distribution of G factorizes in G(n, m)-like distributions for 
each block (Section |9]l. Lastly, we associate with each edge e a unique random variable Ue and construct a coupling 
for edges in each block separately. 

In our notation, Ei ~ Samp(ui, Pi) and E^ ^ Flip(( 7 j*^, Pi). Using Lemm£[3 we construct a coupling for each 
i G [fc] between the random variables Ei, Ef, E~ and bound the probability that the event E~ + Ei + Ei^ does not 
hold. Using the union bound over the k parts, we then obtain an estimate of the probability that the property holds 
across blocks, always conditional on the edge-profile v. The final step involves getting rid of the conditioning by 
invoking the concentration theorem. 

Concretely, define Bi the event that the *-th block does not satisfy the property E~ C Ei C Ei^f and £g(m*) the 
set appearing in Theorem|3] We have that P^(G“ Q G C G+j = 1 — (Ui?i). Conditioning on the edge profile 
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gives: 


veLe(m*) 

< P^(£^(m*))+ max P^(UBi|v) 

veLe(m*) 


< P^(£^(m*))+ max 

v£Le(m*) 




The first inequality holds by conditioning on the edge prohle and bounding the probability of the bad events from above 
by 1 for all “bad” probles (outside of the concentration set). The second inequality, is derived by upper bounding the 
probability of the bad event by the most favorable such edge-proble and the last inequality follows from an application 
of the union bound. Applying Theorem^ we get a bound on the brst term and then invoking Lemma[3]we get a bound 
for each of the term in the sum: 


< exp 


-^x{S) 


1 + e 


-X{S) 


2 max 
veLe(m*) 






3(1 + e) 


Hence, we see that the upper bound is monotone in Vi for all f £ [fc]. Additionally, we know that for all v £ L^{m*) 
it holds that v > (1 — e)m*. Further, by dehnition we have m* > /r(5'). The bound now becomes: 


< exp 


< exp 




2k exp 


3(1+ e) 


m(S') 


- exp 


-n{S) 


e^(l-e) 

3(1+ e) 


log(2fc) Y 
m(5) )_ 


Finally, using e < 1/2 and log(2fc)/p(5') < X{S) we arrive at the required conclusion. 


□ 


9 Supplementary Proofs 


Proof of Proposition\J\ Fix an x £ and consider the set 0(x) = {y £ Hn ■ 37r £ n„(7^) such that y = 7r(x)} 
and call it the orbit of x under n„(P)(note that by group property orbits form a partition of iF„). The assumption of 
symmetry, implies that / is constant for all y £ 0(x): 


./(yi) = /(y 2 ) = /(x), Vyi,y 2 £ 0(x) 


By dehnition of n„ (7^), for any X £ there is a permutation £ n„(7^), suchthati) 7r3;(x) = {j^x,i{xPi)T ■ ■ T'^x,k{xp^)) £ 
0(x), ii) for all i £ [fc], TTx,i{xp^) is a bit-string where all I’s appear consequently starting from the hrst position. Let 
as identify with each orbit O C such a distinct element Xq. As the function of / is constant along each orbit, its 
value depends only through Xq, which in turn depends only on the number of l’s(edges) in each part, encoded in the 
edge prohle m = (mi,..., mfc). □ 


Proof of Proposition^ Consider V any balanced partition consisting of fc-parts and let S' = (/„ to be the space of 
all graphs. Then Z{S) = |S| = 2 ( 2 ) and m*(S) is the all {^/2k vector (all blocks half full). Using Stirling’s 
approximation of the factorial we have that: 


log(|S|) -ENT(m*) > 


log 2 — 2fclog 


lG)/‘ 


> k log ~ 2 ^ 


For k = o(n^) the last expression is of order fl(fc log(n)). 


(26) 

(27) 

□ 
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Proposition 4. Consider for all i € [k] disjoint sets of edges Ii,Oi C Pi and define the events Ai = {G € Gn '■ h C 
E{G) and Oi fl E{G) = 0}. Conditional on the edge profile of G being v, the events are independent, i.e. it holds 
that: Ps (Ai n ... n Afc|v) = nLi 

Proof of Proposition^ Since G ~ U{S) the distribution of G is by definition uniform on S. This also means that it 
is uniform on the subset of graphs having edge profile v G N^(conditioning). But then; 


Ps (Ain...nAfc|v) 


Ps (Ai n... n Afc n m(G) = v) 
P 5 (m(G) = v) 


|Ai n... n Afc n m(G) = 
|m(G) = v| 


Jtm(S) ( 


where the first equality follows from Bayes rule and the second due to uniformity and the fact that our symmetry 
assumption implies that membership in S depends only on the edge-profile m. Recall that each set Ai = {G G Gn ■ 
li C E{G) and Oi fl E{G) = 0} imposes the requirement that the edges in f are included in G and that the edges 
in Oi are not included in G. Having conditioned on C, we know that exactly Vi edges from Pi are included in G and 
that we can satisfy the requirements for edges in Pi by selecting any subset of — |/|i edges out of Pi \ {fUi). For 
convenience set \Pi\ = pi, \I\i = rii, \Oi U Ii\ = Vi, and let Cf denote the number of fc-combinations out of an 
n element set(binomial coefficient). The number of valid subsets of Pi is then given by As the constraints 

imposed are separable, we have: 


|Ain...nAfcnm(G) = v| ^ QiLi ^ A nm(G) = v| 

|m(G) = v| “ |m(G) = v| “ |m(G) = v| 

which gives the required identity by exploiting again uniformity of the probability measure. 


□ 
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